featauR: Automated Feature Selection for Machine Learning Algorithms

Lukas Jan Stroemsdoerfer
Data Scientist @STATWORX

About STATWORX

We are a consulting company for data science, machine learning and statistics with offices in Frankfurt, Zurich and Stuttgart. We support our customers in the development and implementation of data science and machine learning solutions.

Our clients:
Our expertise:

About Our Workflow

Data science projects often follow a similiar structure. At the very beginning, one must load and prep the data, of course. Everything afterwards is fun, the first two parts are not.

About Feature Selection: Problem

Feature Selection is one of the most fundamental tasks in the data science workflow:

  • Why do we need feature selection?
    • Features are all the information you have to predict your target
    • With too many features most models cannot either be identified or are noisy
    • The right combination of features boosts performance, not the algorithm
  • How can we select features?
    • Manually testing all feature set combinations (not cool)
    • Use bivariate metrics like correlations to filter our relevant features (a little cooler)
    • Loop over all feature set combinations and evaluate them based on performance gain (a bit cooler)
    • Use our package (super cool #nobias)

About Feature Selection: Solutions?

Currently there are two main ways to select the relevant features out of the entire feature space:

About Our Idea

About Componentwise-Boosting

Componentwise Gradient Boosting is a boosting ensemble algorithm allowing to discriminate the relevance of features. In its essence, the method follows this algorithm:

About Our Package: Installation

The package is still under developmet and not yet listed on CRAN. However you can get it from GitHub.

# load devtools
install.packages(devtools)
library(devtools)

# download from our public repo
devtools::install_github("STATWORX/bounceR")

# source it
library(bounceR)

If you find any bugs or spot anything that is not super convenient, just open an issue.

About Our Package: Content

The package contains a variety of useful functions surrounding the topic of feature selection, such as:

  • Convenience:
    • sim_data: a function simulating regression and classification data, where the true feature space is known
  • Filtering:
    • featureFiltering: a function implementing several popular filter methods for feature selection
  • Wrapper:
    • featureSelection: a function implementing our home grown algorithm for feature selection
  • Methods:
    • print.sel_obj: an S4 priniting method for the object class “sel_obj”
    • plot.sel_obj: an S4 ploting method for the object class “sel_obj”
    • summary.sel_obj: an S4 summary method for the object class “sel_obj”
    • builder: method to extract a formula with n features from a “sel_obj”

About Our Algorithm: Idea

Each round a random feature importance distribution is initialized. Over the course of \( m \) models, the distribution is adjusted. Essentially our code follows the algorithm:

About Our Algorithm: Pseudo

Essentially we take bits form cool algorithms and put them together. For once, we leverage the complete randomness of random forests. Additionally we apply a somewhat transformed idea of backpropagation.

About Our Algorithm: Usage

Sure, you have a lot of tuning parameters, however we put them all together in a nice and handy little interface. By the way, we set the defaults based on several simulation studies, so you can - sort of - trust them - sometimes.

# Feature Selection using bounceR-----------------------------------------------------
selection <- featureSelection(data = train_df,                                      
                              target = "target",
                              index = NULL,
                              selection = selectionControl(n_rounds = 100,
                                                           n_mods = 1000,
                                                           p = NULL,
                                                           reward = 0.2,
                                                           penalty = 0.3,
                                                           max_features = NULL),
                              bootstrap = "regular",
                              boosting = boostingControl(mstop = 100, nu = 0.1),
                              early_stopping = "aic",
                              n_cores = 6)

About The End

If you have any questions, are interested or have an idea, just contact us!

Our Package: